Inducing Romanization Systems
نویسندگان
چکیده
We propose a method for inducing romanization systems directly from a bilingual alignment at the grapheme level. First, transliteration word pairs are aligned using a non-parametric Bayesian approach, and then for each grapheme sequence to be romanized, a particular romanization is selected according to a user-specified criterium. We apply our approach to the task of transliteration mining, and used Levenshtein distance as the selection criterium. We performed experiments on three languages with differing characteristics: Japanese, Russian and Chinese. Our experiments show that the mining system built from the induced romanization system is able to outperform existing baseline romanization systems. By extending our approach to induce romanization systems based on other criteria we expect our technique may find more general application in the future.
منابع مشابه
A Novel Method to Evaluate Romanization Systems: The Case of Romanizing Arabic Proper Nouns
The transliteration of Arabic proper nouns to other languages is usually based on the phonetic translation of these nouns into their phonetic Latin counterparts. Most of the dictionaries do not include most of these nouns, although some may have meanings. Transliteration is essential generally to Natural Language Processing (NLP) field and specifically to machine translation systems, cross-lang...
متن کاملA Unified Model of Thai Romanization and Word Segmentation
Thai romanization is the way to write Thai language using roman alphabets. It could be performed on the basis of orthographic form (transliteration) or pronunciation (transcription) or both. As a result, many systems of romanization are in use. The Royal Institute has established the standard by proposing the principle of romanization on the basis of transcription. To ensure the standard, a ful...
متن کاملFactored Machine Translation Systems for Russian-English
We describe the LIA machine translation systems for the Russian-English and English-Russian translation tasks. Various factored translation systems were built using MOSES to take into account the morphological complexity of Russian and we experimented with the romanization of untranslated Russian words.
متن کاملDevelopment and Testing of Transcription Software for a Southern Min Spoken Corpus
The usual challenges of transcribing spoken language are compounded for Southern Min (Taiwanese) because it lacks a generally accepted orthography. This study reports the development and testing of software tools for assisting such transcription. Three tools are compared, each representing a different type of interface with our corpus-based Southern Min lexicon (Tsay, 2007): our original Chines...
متن کاملAutomatic Romanization for Thai
There is a common need in romanizing words in the languages other than English for the global communication. Especially the romanization of proper names are inevitable. Since there is no a mutual standard, writing a Thai word in English letters is not trivial, and it is quite a labor intensive task if it cannot be computerized. In this paper, we propose a new romanization system aiming at initi...
متن کامل